Skip to content

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Jul 27, 2025

Here's an initial version of an Integer Dot mul_mat_vec shader. So far it seems to improve performance with q4_1 and q5_1, but reduce it with q4_0, q5_0 and q8_0. My guess is that this is because of the 32-bit loads in q4_1 and q5_1, while the rest use 16-bit loads.

@jeffbolznv Would you mind taking a look and letting me know if I have any obvious performance issues in the shader?

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 27, 2025
@0cc4m
Copy link
Collaborator Author

0cc4m commented Jul 27, 2025

Here are performance results from my tests:

AMD Radeon Pro VII
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 3720 runs -   326.01 us/run - 134.48 MFLOP/run - 412.51 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   274.52 us/run - 134.48 MFLOP/run - 489.87 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    95.15 us/run - 117.44 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   114.44 us/run - 117.44 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   136.38 us/run - 117.44 MFLOP/run - 861.11 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   149.87 us/run - 117.44 MFLOP/run - 783.61 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.03 us/run - 117.44 MFLOP/run - 782.80 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   121.87 us/run - 234.88 MFLOP/run -   1.93 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   181.40 us/run - 234.88 MFLOP/run -   1.29 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   166.30 us/run - 234.88 MFLOP/run -   1.41 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   206.09 us/run - 234.88 MFLOP/run -   1.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   196.76 us/run - 234.88 MFLOP/run -   1.19 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.56 us/run - 352.32 MFLOP/run -   2.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4544 runs -   229.63 us/run - 352.32 MFLOP/run -   1.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5396 runs -   189.94 us/run - 352.32 MFLOP/run -   1.85 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   259.13 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   258.81 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.43 us/run - 469.76 MFLOP/run -   2.52 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3621 runs -   278.23 us/run - 469.76 MFLOP/run -   1.69 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4686 runs -   218.20 us/run - 469.76 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   307.29 us/run - 469.76 MFLOP/run -   1.53 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2769 runs -   382.97 us/run - 469.76 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4617 runs -   224.90 us/run - 587.20 MFLOP/run -   2.61 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3078 runs -   330.95 us/run - 587.20 MFLOP/run -   1.77 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4104 runs -   250.29 us/run - 587.20 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2907 runs -   365.23 us/run - 587.20 MFLOP/run -   1.61 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   452.07 us/run - 587.20 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   337.45 us/run - 939.52 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   682.41 us/run - 939.52 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   335.38 us/run - 939.52 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1391 runs -   725.50 us/run - 939.52 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   677.66 us/run - 939.52 MFLOP/run -   1.39 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      136 runs -  7371.35 us/run -  60.13 GFLOP/run -   8.16 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      130 runs -  7697.38 us/run -  60.13 GFLOP/run -   7.81 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      132 runs -  7584.95 us/run -  60.13 GFLOP/run -   7.93 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      128 runs -  7931.54 us/run -  60.13 GFLOP/run -   7.58 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  8015.00 us/run -  60.13 GFLOP/run -   7.50 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 3720 runs -   326.21 us/run - 134.48 MFLOP/run - 412.25 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   274.08 us/run - 134.48 MFLOP/run - 490.66 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   129.72 us/run - 117.44 MFLOP/run - 905.32 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    62.43 us/run - 117.44 MFLOP/run -   1.88 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   155.69 us/run - 117.44 MFLOP/run - 754.32 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    83.28 us/run - 117.44 MFLOP/run -   1.41 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   216.83 us/run - 117.44 MFLOP/run - 541.62 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   165.83 us/run - 234.88 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.15 us/run - 234.88 MFLOP/run -   3.35 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   200.41 us/run - 234.88 MFLOP/run -   1.17 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    92.60 us/run - 234.88 MFLOP/run -   2.54 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4686 runs -   232.55 us/run - 234.88 MFLOP/run -   1.01 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.32 us/run - 352.32 MFLOP/run -   2.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11360 runs -    89.56 us/run - 352.32 MFLOP/run -   3.93 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   196.72 us/run - 352.32 MFLOP/run -   1.79 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9088 runs -   111.35 us/run - 352.32 MFLOP/run -   3.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   254.72 us/run - 352.32 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5751 runs -   175.38 us/run - 469.76 MFLOP/run -   2.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8733 runs -   115.33 us/run - 469.76 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4899 runs -   206.11 us/run - 469.76 MFLOP/run -   2.28 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   133.48 us/run - 469.76 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   267.06 us/run - 469.76 MFLOP/run -   1.76 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5130 runs -   199.10 us/run - 587.20 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6840 runs -   147.29 us/run - 587.20 MFLOP/run -   3.99 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4446 runs -   228.99 us/run - 587.20 MFLOP/run -   2.56 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5472 runs -   186.59 us/run - 587.20 MFLOP/run -   3.15 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3420 runs -   296.54 us/run - 587.20 MFLOP/run -   1.98 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4922 runs -   205.31 us/run - 939.52 MFLOP/run -   4.58 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7276 runs -   138.46 us/run - 939.52 MFLOP/run -   6.79 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4173 runs -   245.35 us/run - 939.52 MFLOP/run -   3.83 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6313 runs -   160.81 us/run - 939.52 MFLOP/run -   5.84 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3210 runs -   318.22 us/run - 939.52 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      136 runs -  7386.12 us/run -  60.13 GFLOP/run -   8.14 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      130 runs -  7693.49 us/run -  60.13 GFLOP/run -   7.82 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      132 runs -  7594.42 us/run -  60.13 GFLOP/run -   7.92 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      128 runs -  7918.03 us/run -  60.13 GFLOP/run -   7.59 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  8004.06 us/run -  60.13 GFLOP/run -   7.51 TFLOPS
Intel A770
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 9672 runs -   106.14 us/run - 134.48 MFLOP/run -   1.27 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   297.67 us/run - 134.48 MFLOP/run - 451.77 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   147.62 us/run - 117.44 MFLOP/run - 795.55 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   158.42 us/run - 117.44 MFLOP/run - 741.31 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2556 runs -   559.94 us/run - 117.44 MFLOP/run - 209.74 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   198.08 us/run - 117.44 MFLOP/run - 592.89 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   816.05 us/run - 117.44 MFLOP/run - 143.91 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   155.66 us/run - 234.88 MFLOP/run -   1.51 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   185.73 us/run - 234.88 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   483.76 us/run - 234.88 MFLOP/run - 485.54 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   201.83 us/run - 234.88 MFLOP/run -   1.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1278 runs -   953.98 us/run - 234.88 MFLOP/run - 246.21 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6248 runs -   165.98 us/run - 352.32 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   210.20 us/run - 352.32 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1988 runs -   513.99 us/run - 352.32 MFLOP/run - 685.46 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   218.03 us/run - 352.32 MFLOP/run -   1.62 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   648.93 us/run - 352.32 MFLOP/run - 542.93 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.04 us/run - 469.76 MFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   265.17 us/run - 469.76 MFLOP/run -   1.77 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   505.40 us/run - 469.76 MFLOP/run - 929.49 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4047 runs -   258.71 us/run - 469.76 MFLOP/run -   1.82 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1491 runs -   673.07 us/run - 469.76 MFLOP/run - 697.94 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3249 runs -   308.76 us/run - 587.20 MFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   465.28 us/run - 587.20 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1710 runs -   619.83 us/run - 587.20 MFLOP/run - 947.36 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   477.48 us/run - 587.20 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1197 runs -   931.89 us/run - 587.20 MFLOP/run - 630.12 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3103 runs -   330.52 us/run - 939.52 MFLOP/run -   2.84 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2247 runs -   462.68 us/run - 939.52 MFLOP/run -   2.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1712 runs -   589.40 us/run - 939.52 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2140 runs -   470.27 us/run - 939.52 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                963 runs -  1085.13 us/run - 939.52 MFLOP/run - 865.81 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5539.21 us/run -  60.13 GFLOP/run -  10.86 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      184 runs -  5460.43 us/run -  60.13 GFLOP/run -  11.01 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      174 runs -  5796.34 us/run -  60.13 GFLOP/run -  10.37 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      172 runs -  5816.45 us/run -  60.13 GFLOP/run -  10.34 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      160 runs -  6317.52 us/run -  60.13 GFLOP/run -   9.52 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 9672 runs -   105.39 us/run - 134.48 MFLOP/run -   1.28 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   300.54 us/run - 134.48 MFLOP/run - 447.46 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   232.85 us/run - 117.44 MFLOP/run - 504.37 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   127.81 us/run - 117.44 MFLOP/run - 918.88 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4260 runs -   252.01 us/run - 117.44 MFLOP/run - 466.01 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   153.16 us/run - 117.44 MFLOP/run - 766.79 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4260 runs -   253.84 us/run - 117.44 MFLOP/run - 462.65 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   288.94 us/run - 234.88 MFLOP/run - 812.90 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   110.96 us/run - 234.88 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   317.45 us/run - 234.88 MFLOP/run - 739.90 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   135.61 us/run - 234.88 MFLOP/run -   1.73 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   264.55 us/run - 234.88 MFLOP/run - 887.85 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   297.55 us/run - 352.32 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   132.35 us/run - 352.32 MFLOP/run -   2.66 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3124 runs -   339.23 us/run - 352.32 MFLOP/run -   1.04 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6532 runs -   154.97 us/run - 352.32 MFLOP/run -   2.27 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3692 runs -   275.87 us/run - 352.32 MFLOP/run -   1.28 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3195 runs -   316.93 us/run - 469.76 MFLOP/run -   1.48 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   146.76 us/run - 469.76 MFLOP/run -   3.20 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2982 runs -   352.12 us/run - 469.76 MFLOP/run -   1.33 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   181.20 us/run - 469.76 MFLOP/run -   2.59 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   305.57 us/run - 469.76 MFLOP/run -   1.54 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3762 runs -   273.06 us/run - 587.20 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5643 runs -   179.14 us/run - 587.20 MFLOP/run -   3.28 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2736 runs -   369.60 us/run - 587.20 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4788 runs -   212.93 us/run - 587.20 MFLOP/run -   2.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2907 runs -   361.02 us/run - 587.20 MFLOP/run -   1.63 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2568 runs -   400.11 us/run - 939.52 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3424 runs -   300.82 us/run - 939.52 MFLOP/run -   3.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2354 runs -   435.22 us/run - 939.52 MFLOP/run -   2.16 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   337.42 us/run - 939.52 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2782 runs -   371.29 us/run - 939.52 MFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5502.12 us/run -  60.13 GFLOP/run -  10.93 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5522.41 us/run -  60.13 GFLOP/run -  10.89 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      174 runs -  5776.55 us/run -  60.13 GFLOP/run -  10.41 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      166 runs -  6064.83 us/run -  60.13 GFLOP/run -   9.91 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      160 runs -  6308.83 us/run -  60.13 GFLOP/run -   9.53 TFLOPS
Nvidia RTX 3090
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                11160 runs -    94.56 us/run - 134.48 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 7440 runs -   134.50 us/run - 134.48 MFLOP/run - 999.84 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    49.24 us/run - 117.44 MFLOP/run -   2.38 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    54.12 us/run - 117.44 MFLOP/run -   2.17 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    69.91 us/run - 117.44 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.77 us/run - 117.44 MFLOP/run -   1.66 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    82.06 us/run - 117.44 MFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    61.82 us/run - 234.88 MFLOP/run -   3.80 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13206 runs -    77.28 us/run - 234.88 MFLOP/run -   3.04 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12354 runs -    82.16 us/run - 234.88 MFLOP/run -   2.86 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    94.23 us/run - 234.88 MFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    95.96 us/run - 234.88 MFLOP/run -   2.45 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13064 runs -    77.12 us/run - 352.32 MFLOP/run -   4.57 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10508 runs -    96.38 us/run - 352.32 MFLOP/run -   3.66 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10792 runs -    94.85 us/run - 352.32 MFLOP/run -   3.71 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9088 runs -   112.82 us/run - 352.32 MFLOP/run -   3.12 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7952 runs -   126.59 us/run - 352.32 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10863 runs -    93.34 us/run - 469.76 MFLOP/run -   5.03 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8733 runs -   115.35 us/run - 469.76 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8946 runs -   112.26 us/run - 469.76 MFLOP/run -   4.18 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7455 runs -   136.60 us/run - 469.76 MFLOP/run -   3.44 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6603 runs -   156.48 us/run - 469.76 MFLOP/run -   3.00 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9063 runs -   111.42 us/run - 587.20 MFLOP/run -   5.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7353 runs -   138.83 us/run - 587.20 MFLOP/run -   4.23 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   127.26 us/run - 587.20 MFLOP/run -   4.61 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6498 runs -   156.34 us/run - 587.20 MFLOP/run -   3.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5472 runs -   185.98 us/run - 587.20 MFLOP/run -   3.16 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6099 runs -   165.53 us/run - 939.52 MFLOP/run -   5.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4708 runs -   213.55 us/run - 939.52 MFLOP/run -   4.40 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5671 runs -   179.37 us/run - 939.52 MFLOP/run -   5.24 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4387 runs -   229.11 us/run - 939.52 MFLOP/run -   4.10 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3745 runs -   274.08 us/run - 939.52 MFLOP/run -   3.43 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      904 runs -  1108.01 us/run -  60.13 GFLOP/run -  54.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      860 runs -  1164.53 us/run -  60.13 GFLOP/run -  51.63 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      736 runs -  1361.15 us/run -  60.13 GFLOP/run -  44.18 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      736 runs -  1360.98 us/run -  60.13 GFLOP/run -  44.18 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      912 runs -  1097.27 us/run -  60.13 GFLOP/run -  54.80 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                11160 runs -    94.68 us/run - 134.48 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 8184 runs -   130.28 us/run - 134.48 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    50.12 us/run - 117.44 MFLOP/run -   2.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    48.13 us/run - 117.44 MFLOP/run -   2.44 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.03 us/run - 117.44 MFLOP/run -   2.10 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.74 us/run - 117.44 MFLOP/run -   2.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    86.46 us/run - 117.44 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.08 us/run - 234.88 MFLOP/run -   4.99 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    49.93 us/run - 234.88 MFLOP/run -   4.70 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    58.08 us/run - 234.88 MFLOP/run -   4.04 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    58.47 us/run - 234.88 MFLOP/run -   4.02 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11502 runs -    88.02 us/run - 234.88 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19880 runs -    50.74 us/run - 352.32 MFLOP/run -   6.94 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19596 runs -    51.30 us/run - 352.32 MFLOP/run -   6.87 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15904 runs -    63.94 us/run - 352.32 MFLOP/run -   5.51 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16472 runs -    61.01 us/run - 352.32 MFLOP/run -   5.77 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    91.62 us/run - 352.32 MFLOP/run -   3.85 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.33 us/run - 469.76 MFLOP/run -   8.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    57.69 us/run - 469.76 MFLOP/run -   8.14 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15123 runs -    66.30 us/run - 469.76 MFLOP/run -   7.09 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15549 runs -    64.62 us/run - 469.76 MFLOP/run -   7.27 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10437 runs -    97.62 us/run - 469.76 MFLOP/run -   4.81 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15732 runs -    63.62 us/run - 587.20 MFLOP/run -   9.23 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16245 runs -    61.62 us/run - 587.20 MFLOP/run -   9.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14535 runs -    69.60 us/run - 587.20 MFLOP/run -   8.44 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14535 runs -    69.57 us/run - 587.20 MFLOP/run -   8.44 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9576 runs -   104.78 us/run - 587.20 MFLOP/run -   5.60 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12947 runs -    77.25 us/run - 939.52 MFLOP/run -  12.16 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11877 runs -    84.66 us/run - 939.52 MFLOP/run -  11.10 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11877 runs -    84.27 us/run - 939.52 MFLOP/run -  11.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11342 runs -    88.87 us/run - 939.52 MFLOP/run -  10.57 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7597 runs -   133.14 us/run - 939.52 MFLOP/run -   7.06 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      842 runs -  1187.83 us/run -  60.13 GFLOP/run -  50.62 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      784 runs -  1277.27 us/run -  60.13 GFLOP/run -  47.08 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      762 runs -  1313.98 us/run -  60.13 GFLOP/run -  45.76 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      738 runs -  1355.59 us/run -  60.13 GFLOP/run -  44.36 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      924 runs -  1083.58 us/run -  60.13 GFLOP/run -  55.49 TFLOPS
AMD RX 6800 XT
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared 

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 7440 runs -   145.62 us/run - 134.48 MFLOP/run - 923.47 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                20088 runs -    50.37 us/run - 134.48 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.14 us/run - 117.44 MFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    55.37 us/run - 117.44 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.00 us/run - 117.44 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13632 runs -    74.29 us/run - 117.44 MFLOP/run -   1.58 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    58.72 us/run - 117.44 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    61.98 us/run - 234.88 MFLOP/run -   3.79 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    78.87 us/run - 234.88 MFLOP/run -   2.98 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    86.15 us/run - 234.88 MFLOP/run -   2.73 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    98.12 us/run - 234.88 MFLOP/run -   2.39 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11502 runs -    89.74 us/run - 234.88 MFLOP/run -   2.62 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13064 runs -    76.56 us/run - 352.32 MFLOP/run -   4.60 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9940 runs -   102.12 us/run - 352.32 MFLOP/run -   3.45 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -   100.07 us/run - 352.32 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8236 runs -   123.05 us/run - 352.32 MFLOP/run -   2.86 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8236 runs -   122.62 us/run - 352.32 MFLOP/run -   2.87 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    99.78 us/run - 469.76 MFLOP/run -   4.71 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   119.36 us/run - 469.76 MFLOP/run -   3.94 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9159 runs -   110.68 us/run - 469.76 MFLOP/run -   4.24 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7242 runs -   139.27 us/run - 469.76 MFLOP/run -   3.37 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   167.74 us/run - 469.76 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   128.65 us/run - 587.20 MFLOP/run -   4.56 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7011 runs -   144.22 us/run - 587.20 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6669 runs -   150.20 us/run - 587.20 MFLOP/run -   3.91 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6327 runs -   161.58 us/run - 587.20 MFLOP/run -   3.63 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4788 runs -   211.00 us/run - 587.20 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5029 runs -   200.80 us/run - 939.52 MFLOP/run -   4.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4922 runs -   206.88 us/run - 939.52 MFLOP/run -   4.54 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4280 runs -   233.96 us/run - 939.52 MFLOP/run -   4.02 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4494 runs -   225.62 us/run - 939.52 MFLOP/run -   4.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2675 runs -   386.25 us/run - 939.52 MFLOP/run -   2.43 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      348 runs -  2882.03 us/run -  60.13 GFLOP/run -  20.86 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      354 runs -  2837.71 us/run -  60.13 GFLOP/run -  21.19 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      342 runs -  2934.56 us/run -  60.13 GFLOP/run -  20.49 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      336 runs -  2993.35 us/run -  60.13 GFLOP/run -  20.09 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      306 runs -  3282.89 us/run -  60.13 GFLOP/run -  18.32 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 7440 runs -   142.46 us/run - 134.48 MFLOP/run - 943.97 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                20832 runs -    48.66 us/run - 134.48 MFLOP/run -   2.76 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    33.86 us/run - 117.44 MFLOP/run -   3.47 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              36636 runs -    27.51 us/run - 117.44 MFLOP/run -   4.27 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24708 runs -    41.87 us/run - 117.44 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    34.24 us/run - 117.44 MFLOP/run -   3.43 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23004 runs -    44.41 us/run - 117.44 MFLOP/run -   2.64 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21726 runs -    46.39 us/run - 234.88 MFLOP/run -   5.06 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              35358 runs -    28.40 us/run - 234.88 MFLOP/run -   8.27 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    53.68 us/run - 234.88 MFLOP/run -   4.38 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              26412 runs -    38.09 us/run - 234.88 MFLOP/run -   6.17 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20874 runs -    48.58 us/run - 234.88 MFLOP/run -   4.83 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19028 runs -    52.74 us/run - 352.32 MFLOP/run -   6.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24708 runs -    40.71 us/run - 352.32 MFLOP/run -   8.65 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    58.88 us/run - 352.32 MFLOP/run -   5.98 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19880 runs -    50.48 us/run - 352.32 MFLOP/run -   6.98 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18176 runs -    55.56 us/run - 352.32 MFLOP/run -   6.34 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16401 runs -    61.21 us/run - 469.76 MFLOP/run -   7.67 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20874 runs -    48.35 us/run - 469.76 MFLOP/run -   9.72 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14271 runs -    70.92 us/run - 469.76 MFLOP/run -   6.62 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15549 runs -    64.84 us/run - 469.76 MFLOP/run -   7.24 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15762 runs -    63.88 us/run - 469.76 MFLOP/run -   7.35 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16245 runs -    61.77 us/run - 587.20 MFLOP/run -   9.51 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14877 runs -    67.57 us/run - 587.20 MFLOP/run -   8.69 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14022 runs -    71.52 us/run - 587.20 MFLOP/run -   8.21 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12312 runs -    81.39 us/run - 587.20 MFLOP/run -   7.21 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13680 runs -    73.56 us/run - 587.20 MFLOP/run -   7.98 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9844 runs -   102.64 us/run - 939.52 MFLOP/run -   9.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11021 runs -    91.60 us/run - 939.52 MFLOP/run -  10.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9202 runs -   108.77 us/run - 939.52 MFLOP/run -   8.64 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9095 runs -   110.57 us/run - 939.52 MFLOP/run -   8.50 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10486 runs -    95.77 us/run - 939.52 MFLOP/run -   9.81 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      362 runs -  2774.96 us/run -  60.13 GFLOP/run -  21.67 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      356 runs -  2815.14 us/run -  60.13 GFLOP/run -  21.36 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      338 runs -  2968.24 us/run -  60.13 GFLOP/run -  20.26 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      326 runs -  3080.20 us/run -  60.13 GFLOP/run -  19.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      292 runs -  3442.73 us/run -  60.13 GFLOP/run -  17.47 TFLOPS

@jeffbolznv
Copy link
Collaborator

I did a quick before/after on some Q4_0 models, and it looks like the quantization is pretty expensive:

master:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128 -r 10 --prio 1 -m c:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\GLM-4-32B-0414-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        365.51 ± 1.33 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        364.74 ± 3.06 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        236.24 ± 7.06 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        237.61 ± 1.79 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         60.41 ± 0.87 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         60.44 ± 0.15 |

PR:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128 -r 10 --prio 1 -m c:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\GLM-4-32B-0414-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        340.06 ± 1.73 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        339.06 ± 2.71 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |       224.50 ± 10.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        227.18 ± 1.44 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         57.65 ± 0.07 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         57.67 ± 0.11 |

PR with quantize call removed:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128 -r 10 --prio 1 -m c:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\GLM-4-32B-0414-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        372.26 ± 1.13 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        370.48 ± 3.75 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        242.30 ± 3.98 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        243.00 ± 1.00 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         59.49 ± 0.16 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         59.28 ± 0.14 |

I don't think there's anything particularly wrong with how the quantization is implemented, it's such a small amount of work that it doesn't fill the GPU, and 5090 is just about the worst case for that. I don't have any great suggestions for what to do about this.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jul 28, 2025

Yeah, I also see that. We might have to pick a threshold from which using this quantize + integer dot shader path is worth it. Even without further tuning, there are definitely cases where it helps, for example batch 4 and 8 on RX 6800 XT:

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.372 1378.10 6.499 78.78 6.871 149.04
512 512 2 2048 0.734 1394.93 11.341 90.29 12.075 169.60
512 512 4 4096 1.551 1320.62 18.337 111.69 19.887 205.96
512 512 8 8192 3.499 1170.69 34.641 118.24 38.139 214.79
512 512 16 16384 8.295 987.59 59.502 137.68 67.797 241.66
512 512 32 32768 21.548 760.35 85.820 190.91 107.368 305.19

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.372 1376.71 6.980 73.35 7.352 139.28
512 512 2 2048 0.721 1420.49 11.889 86.13 12.610 162.42
512 512 4 4096 1.562 1311.47 17.186 119.17 18.747 218.49
512 512 8 8192 3.482 1176.48 29.917 136.91 33.398 245.28
512 512 16 16384 8.253 992.55 59.530 137.61 67.783 241.71
512 512 32 32768 21.490 762.41 85.655 191.28 107.145 305.83

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from fd8be28 to c19ec8f Compare August 2, 2025 12:15
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 2, 2025

I implemented the q8_1_x4 blocks that align q8_1 to 128-bits, using them does help a little (there's even a pp increase for integer dot prompt processing), but the integer dot mmv path is still too slow to enable universally. I'm thinking about ways to use shared memory in the mmv shader, but not sure if that would help.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 2, 2025

Here are some results from the current version:

Nvidia RTX 3090

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 512 1 4608 3.900 1050.14 6.207 82.49 10.107 455.90
4096 512 2 9216 6.033 1357.88 30.604 33.46 36.636 251.55
4096 512 4 18432 16.503 992.79 58.499 35.01 75.002 245.75

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 512 1 4608 3.912 1047.06 6.444 79.45 10.356 444.97
4096 512 2 9216 6.079 1347.60 30.561 33.51 36.640 251.53
4096 512 4 18432 16.582 988.07 57.161 35.83 73.743 249.95

On Nvidia, the batched-bench seems to have an issue with shader compiles slowing down some of the runs.

AMD Radeon RX 6800 XT

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 512 1 4608 3.565 1149.01 8.900 57.53 12.465 369.68
4096 512 2 9216 8.519 961.64 32.738 31.28 41.256 223.38
4096 512 4 18432 22.255 736.19 61.596 33.25 83.851 219.82

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 512 1 4608 3.225 1269.99 9.385 54.56 12.610 365.42
4096 512 2 9216 7.859 1042.32 32.840 31.18 40.700 226.44
4096 512 4 18432 20.895 784.09 59.938 34.17 80.833 228.03
AMD Radeon Pro VII

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.811 631.55 8.859 57.80 9.669 105.90
512 512 2 2048 1.539 665.54 19.551 52.38 21.090 97.11
512 512 4 4096 3.241 631.98 33.277 61.54 36.517 112.17

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.805 635.87 11.381 44.99 12.186 84.03
512 512 2 2048 1.485 689.54 22.796 44.92 24.281 84.35
512 512 4 4096 3.126 655.05 33.547 61.05 36.673 111.69
Intel A770

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.702 729.48 16.858 30.37 17.560 58.31
512 512 2 2048 1.495 685.10 30.323 33.77 31.818 64.37
512 512 4 4096 3.360 609.61 48.322 42.38 51.681 79.26

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.607 843.46 20.431 25.06 21.038 48.67
512 512 2 2048 1.306 783.96 35.346 28.97 36.652 55.88
512 512 4 4096 2.971 689.24 53.052 38.60 56.024 73.11

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 3, 2025

Here are some new results, performance is looking better now, even for small models. Selecting when to enable this and when not to is still tricky, though.

@jeffbolznv Can you retest on your worst-case 5090? On my 3090 it looks like enabling this path on Nvidia may be worth it on Q4_1 and Q5_1, since they perform best due to 16B alignment. If you see further optimization opportunities, let me know.

Nvidia RTX 3090 (without coopmat1/2)

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s (Master) t/s (PR)
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 pp512 9194.72 ± 323.10 8926.13 ± 203.39
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 tg128 324.21 ± 56.21 311.07 ± 51.04
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 pp512 9189.23 ± 148.56 9296.94 ± 194.27
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 tg128 336.64 ± 10.50 327.20 ± 0.56
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 pp512 8678.07 ± 32.60 9060.36 ± 21.48
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 tg128 304.93 ± 5.38 310.19 ± 4.96
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 pp512 8807.90 ± 204.32 9108.72 ± 30.17
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 tg128 303.30 ± 3.87 292.32 ± 0.86
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 pp512 9058.35 ± 32.73 9101.32 ± 23.69
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 tg128 288.87 ± 2.46 267.09 ± 2.26
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 1912.84 ± 15.65 1924.58 ± 9.18
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 tg128 107.09 ± 0.18 107.85 ± 0.75
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 pp512 1856.80 ± 10.88 1898.31 ± 9.23
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 tg128 101.17 ± 0.30 108.18 ± 0.15
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 pp512 1884.42 ± 11.34 1898.32 ± 8.53
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 75.10 ± 0.15 74.57 ± 0.11
Master:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    49.75 us/run - 117.44 MFLOP/run -   2.36 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    54.75 us/run - 117.44 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    71.58 us/run - 117.44 MFLOP/run -   1.64 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    72.32 us/run - 117.44 MFLOP/run -   1.62 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    82.89 us/run - 117.44 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    62.95 us/run - 234.88 MFLOP/run -   3.73 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    78.99 us/run - 234.88 MFLOP/run -   2.97 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12354 runs -    83.76 us/run - 234.88 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    96.60 us/run - 234.88 MFLOP/run -   2.43 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    98.03 us/run - 234.88 MFLOP/run -   2.40 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    78.66 us/run - 352.32 MFLOP/run -   4.48 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    98.22 us/run - 352.32 MFLOP/run -   3.59 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10508 runs -    96.41 us/run - 352.32 MFLOP/run -   3.65 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8804 runs -   114.84 us/run - 352.32 MFLOP/run -   3.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7952 runs -   128.65 us/run - 352.32 MFLOP/run -   2.74 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    94.40 us/run - 469.76 MFLOP/run -   4.98 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   117.56 us/run - 469.76 MFLOP/run -   4.00 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8946 runs -   113.66 us/run - 469.76 MFLOP/run -   4.13 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7242 runs -   138.49 us/run - 469.76 MFLOP/run -   3.39 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   158.86 us/run - 469.76 MFLOP/run -   2.96 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8892 runs -   113.12 us/run - 587.20 MFLOP/run -   5.19 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7182 runs -   141.11 us/run - 587.20 MFLOP/run -   4.16 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   129.28 us/run - 587.20 MFLOP/run -   4.54 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6327 runs -   158.58 us/run - 587.20 MFLOP/run -   3.70 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5301 runs -   189.32 us/run - 587.20 MFLOP/run -   3.10 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5992 runs -   166.94 us/run - 939.52 MFLOP/run -   5.63 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4708 runs -   216.16 us/run - 939.52 MFLOP/run -   4.35 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5564 runs -   181.81 us/run - 939.52 MFLOP/run -   5.17 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4387 runs -   232.83 us/run - 939.52 MFLOP/run -   4.04 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3638 runs -   277.63 us/run - 939.52 MFLOP/run -   3.38 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      452 runs -  2216.07 us/run -  60.13 GFLOP/run -  27.13 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      436 runs -  2294.29 us/run -  60.13 GFLOP/run -  26.21 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      430 runs -  2333.07 us/run -  60.13 GFLOP/run -  25.77 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      426 runs -  2354.50 us/run -  60.13 GFLOP/run -  25.54 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      450 runs -  2224.99 us/run -  60.13 GFLOP/run -  27.02 TFLOPS

PR:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23004 runs -    44.03 us/run - 117.44 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.81 us/run - 117.44 MFLOP/run -   2.46 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    53.90 us/run - 117.44 MFLOP/run -   2.18 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.25 us/run - 117.44 MFLOP/run -   2.09 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    79.10 us/run - 117.44 MFLOP/run -   1.48 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.81 us/run - 234.88 MFLOP/run -   4.91 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20022 runs -    50.45 us/run - 234.88 MFLOP/run -   4.66 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.73 us/run - 234.88 MFLOP/run -   4.14 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    59.36 us/run - 234.88 MFLOP/run -   3.96 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12354 runs -    81.78 us/run - 234.88 MFLOP/run -   2.87 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19312 runs -    52.29 us/run - 352.32 MFLOP/run -   6.74 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.05 us/run - 352.32 MFLOP/run -   6.29 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    62.64 us/run - 352.32 MFLOP/run -   5.62 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15620 runs -    64.26 us/run - 352.32 MFLOP/run -   5.48 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    85.25 us/run - 352.32 MFLOP/run -   4.13 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16827 runs -    59.72 us/run - 469.76 MFLOP/run -   7.87 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16401 runs -    61.13 us/run - 469.76 MFLOP/run -   7.68 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14271 runs -    70.17 us/run - 469.76 MFLOP/run -   6.69 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    69.48 us/run - 469.76 MFLOP/run -   6.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    91.45 us/run - 469.76 MFLOP/run -   5.14 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14877 runs -    67.93 us/run - 587.20 MFLOP/run -   8.64 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14193 runs -    71.19 us/run - 587.20 MFLOP/run -   8.25 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12825 runs -    78.98 us/run - 587.20 MFLOP/run -   7.43 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12996 runs -    77.27 us/run - 587.20 MFLOP/run -   7.60 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9747 runs -   104.15 us/run - 587.20 MFLOP/run -   5.64 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10486 runs -    95.93 us/run - 939.52 MFLOP/run -   9.79 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10058 runs -    99.53 us/run - 939.52 MFLOP/run -   9.44 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9309 runs -   107.87 us/run - 939.52 MFLOP/run -   8.71 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9416 runs -   106.53 us/run - 939.52 MFLOP/run -   8.82 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6741 runs -   150.02 us/run - 939.52 MFLOP/run -   6.26 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      456 runs -  2200.07 us/run -  60.13 GFLOP/run -  27.33 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      450 runs -  2224.65 us/run -  60.13 GFLOP/run -  27.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      428 runs -  2342.33 us/run -  60.13 GFLOP/run -  25.67 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      438 runs -  2284.31 us/run -  60.13 GFLOP/run -  26.32 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      456 runs -  2196.48 us/run -  60.13 GFLOP/run -  27.38 TFLOPS
AMD Radeon RX 6800 XT

ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s (Master) t/s (PR)
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 pp512 8290.44 ± 68.15 9310.68 ± 184.78
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 tg128 361.90 ± 0.40 346.92 ± 2.04
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 pp512 7996.84 ± 80.20 9143.00 ± 181.95
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 tg128 332.41 ± 0.99 333.82 ± 0.45
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 pp512 7788.58 ± 40.65 8975.54 ± 158.83
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 tg128 313.75 ± 0.63 317.04 ± 0.36
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 pp512 7645.57 ± 60.32 8879.91 ± 246.04
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 tg128 275.61 ± 0.57 299.83 ± 4.70
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 pp512 7207.73 ± 13.38 8179.54 ± 114.85
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 tg128 265.07 ± 0.14 249.98 ± 0.02
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 1460.24 ± 0.64 1651.88 ± 2.28
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 tg128 85.71 ± 0.05 86.02 ± 0.01
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 pp512 1421.29 ± 2.25 1602.68 ± 2.05
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 tg128 77.33 ± 0.15 79.11 ± 0.02
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 pp512 1247.77 ± 0.86 1391.76 ± 1.91
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 54.01 ± 0.06 53.82 ± 0.01
Master:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23004 runs -    45.06 us/run - 117.44 MFLOP/run -   2.61 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    53.43 us/run - 117.44 MFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15336 runs -    67.74 us/run - 117.44 MFLOP/run -   1.73 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    71.62 us/run - 117.44 MFLOP/run -   1.64 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.38 us/run - 117.44 MFLOP/run -   2.08 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    59.68 us/run - 234.88 MFLOP/run -   3.94 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13206 runs -    75.97 us/run - 234.88 MFLOP/run -   3.09 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12354 runs -    83.24 us/run - 234.88 MFLOP/run -   2.82 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    94.06 us/run - 234.88 MFLOP/run -   2.50 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    86.19 us/run - 234.88 MFLOP/run -   2.73 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13632 runs -    73.55 us/run - 352.32 MFLOP/run -   4.79 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    99.13 us/run - 352.32 MFLOP/run -   3.55 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10508 runs -    97.12 us/run - 352.32 MFLOP/run -   3.63 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8804 runs -   116.30 us/run - 352.32 MFLOP/run -   3.03 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   119.58 us/run - 352.32 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    99.28 us/run - 469.76 MFLOP/run -   4.73 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8307 runs -   120.53 us/run - 469.76 MFLOP/run -   3.90 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9159 runs -   110.02 us/run - 469.76 MFLOP/run -   4.27 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7242 runs -   138.17 us/run - 469.76 MFLOP/run -   3.40 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   168.71 us/run - 469.76 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   129.20 us/run - 587.20 MFLOP/run -   4.54 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7011 runs -   143.02 us/run - 587.20 MFLOP/run -   4.11 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6669 runs -   150.73 us/run - 587.20 MFLOP/run -   3.90 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6327 runs -   160.34 us/run - 587.20 MFLOP/run -   3.66 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4788 runs -   211.90 us/run - 587.20 MFLOP/run -   2.77 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5029 runs -   200.66 us/run - 939.52 MFLOP/run -   4.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4815 runs -   208.91 us/run - 939.52 MFLOP/run -   4.50 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4173 runs -   240.01 us/run - 939.52 MFLOP/run -   3.91 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4387 runs -   230.94 us/run - 939.52 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2782 runs -   373.07 us/run - 939.52 MFLOP/run -   2.52 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      354 runs -  2832.15 us/run -  60.13 GFLOP/run -  21.23 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      346 runs -  2893.23 us/run -  60.13 GFLOP/run -  20.78 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      338 runs -  2970.86 us/run -  60.13 GFLOP/run -  20.24 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      332 runs -  3017.38 us/run -  60.13 GFLOP/run -  19.93 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      304 runs -  3298.77 us/run -  60.13 GFLOP/run -  18.23 TFLOPS

PR:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              33228 runs -    30.60 us/run - 117.44 MFLOP/run -   3.84 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              34080 runs -    30.07 us/run - 117.44 MFLOP/run -   3.91 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23004 runs -    44.67 us/run - 117.44 MFLOP/run -   2.63 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              25560 runs -    40.09 us/run - 117.44 MFLOP/run -   2.93 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              26412 runs -    38.47 us/run - 117.44 MFLOP/run -   3.05 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23856 runs -    41.98 us/run - 234.88 MFLOP/run -   5.60 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23856 runs -    42.50 us/run - 234.88 MFLOP/run -   5.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    58.52 us/run - 234.88 MFLOP/run -   4.01 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19596 runs -    52.02 us/run - 234.88 MFLOP/run -   4.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    53.66 us/run - 234.88 MFLOP/run -   4.38 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19028 runs -    53.15 us/run - 352.32 MFLOP/run -   6.63 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.06 us/run - 352.32 MFLOP/run -   6.28 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15052 runs -    67.57 us/run - 352.32 MFLOP/run -   5.21 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15904 runs -    63.25 us/run - 352.32 MFLOP/run -   5.57 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.08 us/run - 352.32 MFLOP/run -   5.03 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15336 runs -    65.73 us/run - 469.76 MFLOP/run -   7.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15549 runs -    64.68 us/run - 469.76 MFLOP/run -   7.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    78.39 us/run - 469.76 MFLOP/run -   5.99 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13206 runs -    76.58 us/run - 469.76 MFLOP/run -   6.13 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    84.79 us/run - 469.76 MFLOP/run -   5.54 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12996 runs -    76.95 us/run - 587.20 MFLOP/run -   7.63 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12996 runs -    77.88 us/run - 587.20 MFLOP/run -   7.54 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11115 runs -    90.56 us/run - 587.20 MFLOP/run -   6.48 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11457 runs -    87.54 us/run - 587.20 MFLOP/run -   6.71 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9918 runs -   101.76 us/run - 587.20 MFLOP/run -   5.77 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8881 runs -   112.68 us/run - 939.52 MFLOP/run -   8.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8774 runs -   114.75 us/run - 939.52 MFLOP/run -   8.19 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7918 runs -   127.01 us/run - 939.52 MFLOP/run -   7.40 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7918 runs -   126.95 us/run - 939.52 MFLOP/run -   7.40 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6741 runs -   149.44 us/run - 939.52 MFLOP/run -   6.29 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      424 runs -  2361.22 us/run -  60.13 GFLOP/run -  25.47 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      412 runs -  2433.19 us/run -  60.13 GFLOP/run -  24.71 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      406 runs -  2469.37 us/run -  60.13 GFLOP/run -  24.35 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      370 runs -  2706.55 us/run -  60.13 GFLOP/run -  22.22 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      346 runs -  2890.54 us/run -  60.13 GFLOP/run -  20.80 TFLOPS
AMD Radeon Pro VII

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s (Master) t/s (PR)
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 pp512 4319.68 ± 10.73 4166.32 ± 27.82
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 tg128 237.71 ± 7.57 225.20 ± 10.54
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 pp512 3846.38 ± 24.25 3821.60 ± 8.56
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 tg128 188.97 ± 0.75 246.49 ± 2.26
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 pp512 4089.88 ± 14.79 3985.56 ± 15.74
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 tg128 185.73 ± 1.35 199.39 ± 7.51
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 pp512 3694.51 ± 13.06 3686.57 ± 14.67
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 tg128 175.89 ± 0.34 225.69 ± 2.53
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 pp512 3858.94 ± 10.64 3830.51 ± 13.83
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 tg128 190.62 ± 1.98 198.15 ± 2.20
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 681.63 ± 0.85 708.55 ± 0.63
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 tg128 64.35 ± 0.59 72.14 ± 0.92
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 pp512 617.22 ± 0.26 651.25 ± 0.74
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 tg128 55.57 ± 0.09 83.98 ± 1.04
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 pp512 614.20 ± 0.14 633.82 ± 0.79
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 45.15 ± 0.66 50.87 ± 0.19
Master:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    94.62 us/run - 117.44 MFLOP/run -   1.24 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   114.40 us/run - 117.44 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   136.06 us/run - 117.44 MFLOP/run - 863.17 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   149.48 us/run - 117.44 MFLOP/run - 785.65 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   149.81 us/run - 117.44 MFLOP/run - 783.92 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   121.36 us/run - 234.88 MFLOP/run -   1.94 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   180.98 us/run - 234.88 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   165.74 us/run - 234.88 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   205.79 us/run - 234.88 MFLOP/run -   1.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   196.30 us/run - 234.88 MFLOP/run -   1.20 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.00 us/run - 352.32 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4544 runs -   229.42 us/run - 352.32 MFLOP/run -   1.54 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5396 runs -   189.78 us/run - 352.32 MFLOP/run -   1.86 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   259.73 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   258.44 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   185.89 us/run - 469.76 MFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3621 runs -   278.23 us/run - 469.76 MFLOP/run -   1.69 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4686 runs -   218.55 us/run - 469.76 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   308.20 us/run - 469.76 MFLOP/run -   1.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2769 runs -   383.07 us/run - 469.76 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4446 runs -   225.21 us/run - 587.20 MFLOP/run -   2.61 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3078 runs -   330.66 us/run - 587.20 MFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4104 runs -   250.37 us/run - 587.20 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2736 runs -   366.23 us/run - 587.20 MFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   453.64 us/run - 587.20 MFLOP/run -   1.29 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   337.33 us/run - 939.52 MFLOP/run -   2.79 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   682.17 us/run - 939.52 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   336.23 us/run - 939.52 MFLOP/run -   2.79 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1391 runs -   724.83 us/run - 939.52 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   680.78 us/run - 939.52 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      136 runs -  7389.38 us/run -  60.13 GFLOP/run -   8.14 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      128 runs -  7842.16 us/run -  60.13 GFLOP/run -   7.67 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      132 runs -  7683.52 us/run -  60.13 GFLOP/run -   7.83 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  7996.14 us/run -  60.13 GFLOP/run -   7.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      124 runs -  8115.40 us/run -  60.13 GFLOP/run -   7.41 TFLOPS

PR:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13632 runs -    73.72 us/run - 117.44 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    58.84 us/run - 117.44 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   111.69 us/run - 117.44 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    80.30 us/run - 117.44 MFLOP/run -   1.46 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   128.43 us/run - 117.44 MFLOP/run - 914.40 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    84.72 us/run - 234.88 MFLOP/run -   2.77 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.82 us/run - 234.88 MFLOP/run -   3.32 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8094 runs -   124.44 us/run - 234.88 MFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    95.75 us/run - 234.88 MFLOP/run -   2.45 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   148.05 us/run - 234.88 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10508 runs -    97.60 us/run - 352.32 MFLOP/run -   3.61 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11360 runs -    88.90 us/run - 352.32 MFLOP/run -   3.96 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7384 runs -   137.00 us/run - 352.32 MFLOP/run -   2.57 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9088 runs -   112.16 us/run - 352.32 MFLOP/run -   3.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6248 runs -   167.61 us/run - 352.32 MFLOP/run -   2.10 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8946 runs -   113.21 us/run - 469.76 MFLOP/run -   4.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9798 runs -   104.24 us/run - 469.76 MFLOP/run -   4.51 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6603 runs -   153.85 us/run - 469.76 MFLOP/run -   3.05 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7881 runs -   130.09 us/run - 469.76 MFLOP/run -   3.61 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.63 us/run - 469.76 MFLOP/run -   2.52 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8037 runs -   126.66 us/run - 587.20 MFLOP/run -   4.64 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8379 runs -   120.66 us/run - 587.20 MFLOP/run -   4.87 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6156 runs -   166.61 us/run - 587.20 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7011 runs -   143.56 us/run - 587.20 MFLOP/run -   4.09 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4959 runs -   205.72 us/run - 587.20 MFLOP/run -   2.85 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5778 runs -   174.42 us/run - 939.52 MFLOP/run -   5.39 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5778 runs -   175.33 us/run - 939.52 MFLOP/run -   5.36 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4815 runs -   208.77 us/run - 939.52 MFLOP/run -   4.50 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5136 runs -   197.36 us/run - 939.52 MFLOP/run -   4.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3745 runs -   269.83 us/run - 939.52 MFLOP/run -   3.48 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      130 runs -  7794.81 us/run -  60.13 GFLOP/run -   7.71 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      122 runs -  8280.09 us/run -  60.13 GFLOP/run -   7.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  8040.53 us/run -  60.13 GFLOP/run -   7.48 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      120 runs -  8455.30 us/run -  60.13 GFLOP/run -   7.11 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      118 runs -  8612.27 us/run -  60.13 GFLOP/run -   6.98 TFLOPS
Intel A770

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s (Master) t/s (PR)
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 pp512 3848.99 ± 230.91 4224.15 ± 272.97
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 tg128 116.43 ± 1.40 120.02 ± 0.12
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 pp512 3844.84 ± 230.45 4211.68 ± 262.99
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 tg128 115.23 ± 2.04 132.14 ± 0.05
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 pp512 3700.45 ± 205.58 4016.62 ± 236.41
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 tg128 58.72 ± 0.07 78.30 ± 0.09
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 pp512 3730.14 ± 212.64 4073.48 ± 250.03
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 tg128 102.79 ± 0.14 117.86 ± 0.04
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 pp512 3551.92 ± 194.54 3872.60 ± 232.64
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 tg128 53.32 ± 0.11 120.84 ± 0.04
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 739.69 ± 0.93 843.71 ± 0.86
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 tg128 32.68 ± 0.03 33.73 ± 0.04
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 pp512 740.14 ± 1.95 839.24 ± 1.23
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 tg128 32.51 ± 0.05 41.05 ± 0.01
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 pp512 657.79 ± 1.01 737.46 ± 1.02
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 9.85 ± 0.00 32.37 ± 0.05
Master:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   149.17 us/run - 117.44 MFLOP/run - 787.29 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   153.80 us/run - 117.44 MFLOP/run - 763.62 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2556 runs -   554.93 us/run - 117.44 MFLOP/run - 211.63 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   198.64 us/run - 117.44 MFLOP/run - 591.24 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   819.78 us/run - 117.44 MFLOP/run - 143.26 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   155.23 us/run - 234.88 MFLOP/run -   1.51 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.86 us/run - 234.88 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   482.87 us/run - 234.88 MFLOP/run - 486.43 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   203.09 us/run - 234.88 MFLOP/run -   1.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1278 runs -   963.53 us/run - 234.88 MFLOP/run - 243.77 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6248 runs -   165.31 us/run - 352.32 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   207.60 us/run - 352.32 MFLOP/run -   1.70 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1988 runs -   515.32 us/run - 352.32 MFLOP/run - 683.69 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   218.91 us/run - 352.32 MFLOP/run -   1.61 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   652.91 us/run - 352.32 MFLOP/run - 539.61 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.39 us/run - 469.76 MFLOP/run -   2.52 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4047 runs -   253.83 us/run - 469.76 MFLOP/run -   1.85 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   507.43 us/run - 469.76 MFLOP/run - 925.76 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4047 runs -   249.59 us/run - 469.76 MFLOP/run -   1.88 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1491 runs -   673.97 us/run - 469.76 MFLOP/run - 697.00 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3420 runs -   305.60 us/run - 587.20 MFLOP/run -   1.92 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2394 runs -   446.38 us/run - 587.20 MFLOP/run -   1.32 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1710 runs -   623.58 us/run - 587.20 MFLOP/run - 941.67 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   477.75 us/run - 587.20 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1197 runs -   924.28 us/run - 587.20 MFLOP/run - 635.31 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3103 runs -   326.34 us/run - 939.52 MFLOP/run -   2.88 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2140 runs -   473.78 us/run - 939.52 MFLOP/run -   1.98 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1712 runs -   585.79 us/run - 939.52 MFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2247 runs -   453.44 us/run - 939.52 MFLOP/run -   2.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                963 runs -  1087.41 us/run - 939.52 MFLOP/run - 864.00 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      172 runs -  5838.30 us/run -  60.13 GFLOP/run -  10.30 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5495.88 us/run -  60.13 GFLOP/run -  10.94 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      176 runs -  5734.22 us/run -  60.13 GFLOP/run -  10.49 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      162 runs -  6229.44 us/run -  60.13 GFLOP/run -   9.65 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      160 runs -  6299.02 us/run -  60.13 GFLOP/run -   9.55 TFLOPS

PR:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   168.10 us/run - 117.44 MFLOP/run - 698.62 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   114.36 us/run - 117.44 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2556 runs -   404.84 us/run - 117.44 MFLOP/run - 290.09 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   153.53 us/run - 117.44 MFLOP/run - 764.95 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   190.12 us/run - 117.44 MFLOP/run - 617.70 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   172.44 us/run - 234.88 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   146.98 us/run - 234.88 MFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2982 runs -   339.21 us/run - 234.88 MFLOP/run - 692.43 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   206.71 us/run - 234.88 MFLOP/run -   1.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   211.92 us/run - 234.88 MFLOP/run -   1.11 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3692 runs -   277.84 us/run - 352.32 MFLOP/run -   1.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6532 runs -   155.90 us/run - 352.32 MFLOP/run -   2.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2272 runs -   448.46 us/run - 352.32 MFLOP/run - 785.62 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2840 runs -   361.84 us/run - 352.32 MFLOP/run - 973.69 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   264.50 us/run - 352.32 MFLOP/run -   1.33 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3195 runs -   332.54 us/run - 469.76 MFLOP/run -   1.41 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3195 runs -   324.34 us/run - 469.76 MFLOP/run -   1.45 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   484.14 us/run - 469.76 MFLOP/run - 970.31 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2769 runs -   378.15 us/run - 469.76 MFLOP/run -   1.24 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   267.34 us/run - 469.76 MFLOP/run -   1.76 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2736 runs -   366.77 us/run - 587.20 MFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2565 runs -   392.54 us/run - 587.20 MFLOP/run -   1.50 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1881 runs -   558.01 us/run - 587.20 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2052 runs -   511.96 us/run - 587.20 MFLOP/run -   1.15 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2907 runs -   350.20 us/run - 587.20 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1605 runs -   639.74 us/run - 939.52 MFLOP/run -   1.47 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   706.44 us/run - 939.52 MFLOP/run -   1.33 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1284 runs -   817.76 us/run - 939.52 MFLOP/run -   1.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1391 runs -   750.15 us/run - 939.52 MFLOP/run -   1.25 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1712 runs -   590.37 us/run - 939.52 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      222 runs -  4508.30 us/run -  60.13 GFLOP/run -  13.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      216 runs -  4641.71 us/run -  60.13 GFLOP/run -  12.95 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      202 runs -  4961.32 us/run -  60.13 GFLOP/run -  12.12 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      208 runs -  4838.35 us/run -  60.13 GFLOP/run -  12.43 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      186 runs -  5414.21 us/run -  60.13 GFLOP/run -  11.11 TFLOPS

@jeffbolznv
Copy link
Collaborator

Some quick results:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128 -r 10 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\Meta-Llama-3-8B.Q4_0.gguf -m c:\models\Meta-Llama-3-8B.Q4_1.gguf -m c:\models\Meta-Llama-3-8B.Q5_0.gguf -m c:\models\Meta-Llama-3-8B.Q5_1.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        237.90 ± 0.57 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        218.49 ± 3.46 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        188.38 ± 6.95 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        171.80 ± 3.68 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        161.96 ± 2.28 |

build: 6c7a4411 (6076)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128 -r 10 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\Meta-Llama-3-8B.Q4_0.gguf -m c:\models\Meta-Llama-3-8B.Q4_1.gguf -m c:\models\Meta-Llama-3-8B.Q5_0.gguf -m c:\models\Meta-Llama-3-8B.Q5_1.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        224.81 ± 0.64 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        205.56 ± 7.02 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        191.34 ± 5.13 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        176.42 ± 4.35 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        172.84 ± 5.79 |

build: 32585e7c (6072)

before:

  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                16368 runs -    63.20 us/run - 134.48 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                14136 runs -    70.81 us/run - 134.48 MFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              62196 runs -    16.20 us/run - 117.44 MFLOP/run -   7.25 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              56232 runs -    17.80 us/run - 117.44 MFLOP/run -   6.60 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              37488 runs -    27.04 us/run - 117.44 MFLOP/run -   4.34 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              36636 runs -    27.44 us/run - 117.44 MFLOP/run -   4.28 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              49416 runs -    20.55 us/run - 117.44 MFLOP/run -   5.72 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              48564 runs -    20.60 us/run - 234.88 MFLOP/run -  11.40 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              44304 runs -    22.61 us/run - 234.88 MFLOP/run -  10.39 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              32376 runs -    31.23 us/run - 234.88 MFLOP/run -   7.52 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              32802 runs -    30.56 us/run - 234.88 MFLOP/run -   7.69 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              32376 runs -    31.24 us/run - 234.88 MFLOP/run -   7.52 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              37772 runs -    26.58 us/run - 352.32 MFLOP/run -  13.26 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              30956 runs -    32.55 us/run - 352.32 MFLOP/run -  10.82 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              25844 runs -    38.75 us/run - 352.32 MFLOP/run -   9.09 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24140 runs -    41.79 us/run - 352.32 MFLOP/run -   8.43 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21868 runs -    46.05 us/run - 352.32 MFLOP/run -   7.65 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    33.66 us/run - 469.76 MFLOP/run -  13.96 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              25347 runs -    39.74 us/run - 469.76 MFLOP/run -  11.82 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              22152 runs -    45.45 us/run - 469.76 MFLOP/run -  10.34 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20874 runs -    48.12 us/run - 469.76 MFLOP/run -   9.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18318 runs -    55.05 us/run - 469.76 MFLOP/run -   8.53 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24111 runs -    41.58 us/run - 587.20 MFLOP/run -  14.12 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20007 runs -    50.41 us/run - 587.20 MFLOP/run -  11.65 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19152 runs -    52.62 us/run - 587.20 MFLOP/run -  11.16 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17613 runs -    57.09 us/run - 587.20 MFLOP/run -  10.28 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14706 runs -    68.69 us/run - 587.20 MFLOP/run -   8.55 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12305 runs -    81.67 us/run - 939.52 MFLOP/run -  11.50 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12305 runs -    81.49 us/run - 939.52 MFLOP/run -  11.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10914 runs -    92.02 us/run - 939.52 MFLOP/run -  10.21 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11021 runs -    91.32 us/run - 939.52 MFLOP/run -  10.29 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4708 runs -   216.60 us/run - 939.52 MFLOP/run -   4.34 TFLOPS
  
after:

  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                16368 runs -    63.43 us/run - 134.48 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                14136 runs -    71.03 us/run - 134.48 MFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              88608 runs -    11.36 us/run - 117.44 MFLOP/run -  10.33 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              88608 runs -    11.36 us/run - 117.44 MFLOP/run -  10.34 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              72420 runs -    13.82 us/run - 117.44 MFLOP/run -   8.50 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              77532 runs -    13.03 us/run - 117.44 MFLOP/run -   9.02 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              62196 runs -    16.25 us/run - 117.44 MFLOP/run -   7.23 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              73272 runs -    13.68 us/run - 234.88 MFLOP/run -  17.17 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              70716 runs -    14.22 us/run - 234.88 MFLOP/run -  16.51 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              56658 runs -    17.65 us/run - 234.88 MFLOP/run -  13.31 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              65178 runs -    15.37 us/run - 234.88 MFLOP/run -  15.28 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              49416 runs -    20.36 us/run - 234.88 MFLOP/run -  11.54 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              55096 runs -    18.23 us/run - 352.32 MFLOP/run -  19.33 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              56232 runs -    17.87 us/run - 352.32 MFLOP/run -  19.72 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              48848 runs -    20.53 us/run - 352.32 MFLOP/run -  17.16 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              51688 runs -    19.39 us/run - 352.32 MFLOP/run -  18.17 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              39760 runs -    25.21 us/run - 352.32 MFLOP/run -  13.97 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              47712 runs -    20.97 us/run - 469.76 MFLOP/run -  22.40 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              46221 runs -    21.67 us/run - 469.76 MFLOP/run -  21.68 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              42387 runs -    23.60 us/run - 469.76 MFLOP/run -  19.91 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              43239 runs -    23.23 us/run - 469.76 MFLOP/run -  20.22 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              31950 runs -    31.50 us/run - 469.76 MFLOP/run -  14.91 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              40356 runs -    24.85 us/run - 587.20 MFLOP/run -  23.63 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              39330 runs -    25.43 us/run - 587.20 MFLOP/run -  23.09 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              35739 runs -    28.06 us/run - 587.20 MFLOP/run -  20.92 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              37278 runs -    26.91 us/run - 587.20 MFLOP/run -  21.82 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              27702 runs -    36.21 us/run - 587.20 MFLOP/run -  16.22 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              27927 runs -    35.85 us/run - 939.52 MFLOP/run -  26.21 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              22791 runs -    44.02 us/run - 939.52 MFLOP/run -  21.34 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              25680 runs -    39.01 us/run - 939.52 MFLOP/run -  24.08 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21186 runs -    47.32 us/run - 939.52 MFLOP/run -  19.85 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19581 runs -    51.14 us/run - 939.52 MFLOP/run -  18.37 TFLOPS

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 3, 2025

Thank you, that shows I'm on the right path.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from 32585e7 to afc464a Compare August 17, 2025 14:01
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 17, 2025

@jeffbolznv I retested this and find that it now improves tg performance in most of my tests. I did an Nvidia driver update to 580.76.05, not sure if that helped. I think this is ready to merge, but I'll wait until #15355 to fix the subgroup reduce conflict.

Can you give this another try and let me know if you have any concerns?

Nvidia RTX 3090 (without coopmat)

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 1979.04 ± 9.97 1983.01 ± 14.40 +0.2%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 139.60 ± 0.42 146.23 ± 0.31 +4.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 1973.27 ± 7.60 1986.97 ± 12.57 +0.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 141.57 ± 0.35 145.54 ± 0.31 +2.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1916.91 ± 7.65 1930.39 ± 12.38 +0.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 125.68 ± 2.17 130.67 ± 0.16 +4.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1854.64 ± 63.76 1875.40 ± 25.48 +1.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 122.26 ± 0.80 126.48 ± 1.52 +3.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1859.86 ± 9.22 1907.00 ± 12.33 +2.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 117.76 ± 0.47 123.85 ± 0.82 +5.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1785.04 ± 37.53 1836.46 ± 29.76 +2.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 115.01 ± 0.18 120.49 ± 0.35 +4.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1884.02 ± 11.26 1880.50 ± 32.65 -0.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 84.67 ± 0.13 85.86 ± 0.46 +1.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1837.75 ± 25.64 1808.84 ± 22.32 -1.6%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 83.07 ± 0.12 84.15 ± 0.08 +1.3%

AMD Radeon RX 6800 XT

ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 1498.28 ± 2.62 1701.59 ± 2.37 +13.6%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 99.44 ± 0.02 96.28 ± 0.03 -3.2%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 1475.39 ± 0.59 1677.09 ± 1.45 +13.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 98.39 ± 0.02 95.10 ± 0.02 -3.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1451.42 ± 1.19 1640.18 ± 1.57 +13.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 88.29 ± 0.03 88.86 ± 0.02 +0.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1395.78 ± 0.47 1576.13 ± 0.42 +12.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 83.35 ± 0.02 84.49 ± 0.01 +1.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1413.33 ± 1.49 1592.03 ± 1.74 +12.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 79.65 ± 0.02 81.68 ± 0.01 +2.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1360.71 ± 0.40 1530.01 ± 0.55 +12.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 75.42 ± 0.01 77.57 ± 0.03 +2.9%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1240.33 ± 1.34 1382.96 ± 1.47 +11.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 55.32 ± 0.01 55.17 ± 0.01 -0.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1203.89 ± 0.64 1340.01 ± 0.26 +11.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 52.84 ± 0.01 52.75 ± 0.01 -0.2%

AMD Radeon Pro VII

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 806.60 ± 0.64 889.24 ± 4.58 +10.2%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 75.75 ± 0.55 96.14 ± 0.32 +26.9%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 719.33 ± 0.67 767.71 ± 0.88 +6.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 77.48 ± 0.26 98.92 ± 0.34 +27.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 718.33 ± 1.49 735.30 ± 0.58 +2.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 69.15 ± 0.35 87.23 ± 1.04 +26.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 650.45 ± 0.71 665.57 ± 0.23 +2.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 66.37 ± 0.34 79.83 ± 0.54 +20.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 615.49 ± 0.40 645.20 ± 0.72 +4.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 60.94 ± 0.07 85.82 ± 0.31 +40.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 565.78 ± 1.11 590.54 ± 0.68 +4.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 57.84 ± 0.53 80.80 ± 0.13 +39.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 630.28 ± 0.27 643.85 ± 0.23 +2.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 58.55 ± 0.49 63.85 ± 0.11 +9.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 576.81 ± 0.27 586.58 ± 4.07 +1.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 55.46 ± 0.20 60.79 ± 0.04 +9.6%

Intel A770

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 642.91 ± 0.53 742.16 ± 0.63 +15.4%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 40.58 ± 0.10 42.26 ± 0.02 +4.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 230.80 ± 0.24 242.04 ± 0.26 +4.9%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 41.18 ± 0.07 42.89 ± 0.05 +4.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 739.25 ± 0.74 829.59 ± 0.38 +12.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 33.77 ± 0.03 34.99 ± 0.03 +3.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 239.08 ± 0.17 251.47 ± 0.11 +5.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 26.13 ± 0.03 26.40 ± 0.01 +1.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 736.43 ± 1.42 820.08 ± 3.56 +11.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 33.52 ± 0.01 41.40 ± 0.00 +23.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 240.77 ± 0.10 251.12 ± 0.10 +4.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 25.97 ± 0.02 29.82 ± 0.03 +14.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 655.59 ± 1.38 731.98 ± 0.77 +11.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 9.93 ± 0.01 33.60 ± 0.01 +238.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 231.13 ± 0.29 241.90 ± 0.07 +4.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 9.14 ± 0.01 25.75 ± 0.03 +181.7%

@jeffbolznv
Copy link
Collaborator

Sure, I'd like to retest this after it's rebased past #15355, so I can see how it interacts with the different workgroup sizes. But this looks really promising.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from afc464a to 39d620a Compare August 17, 2025 19:13
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 17, 2025

I fixed a quantization bug and did the bare minimum to make this work side by side with #15355. Combining the optimizations is messier than I thought, especially with now three variants of the reduce function. I tried using yours, but that is measurably slower than my subgroup-only variant (probably due to no shared memory). I guess I might need 3 variants of my shader, and maybe that is also worth it for your DMMV_WG_SIZE_SUBGROUP path. I'll take another look tomorrow.

@jeffbolznv
Copy link
Collaborator

I'm still seeing slowdowns, particularly for Q8_0 and usually (but not always) for Q4_0:

5090 before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        221.60 ± 2.15 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        273.46 ± 0.55 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |        175.91 ± 4.49 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         62.26 ± 0.22 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       347.20 ± 22.44 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        201.33 ± 9.39 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        190.95 ± 6.49 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        172.70 ± 7.62 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        168.39 ± 3.16 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        144.10 ± 6.53 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        150.18 ± 7.15 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |       239.63 ± 12.84 |

5090 after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        216.56 ± 1.40 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        249.90 ± 1.82 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |        178.59 ± 7.13 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         60.79 ± 0.04 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        319.19 ± 6.72 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        200.42 ± 4.91 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        193.13 ± 2.99 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        177.13 ± 6.04 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        174.55 ± 4.17 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        140.05 ± 4.92 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        144.76 ± 6.34 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        228.56 ± 1.37 |

4070 before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        102.36 ± 0.18 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        117.95 ± 0.11 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |         79.08 ± 1.66 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        183.58 ± 0.47 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         92.25 ± 1.71 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         84.83 ± 1.64 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         77.66 ± 1.51 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         72.70 ± 1.23 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         55.01 ± 0.04 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |         57.48 ± 0.52 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        116.96 ± 0.40 |

4070 after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |         99.57 ± 0.22 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        114.50 ± 0.26 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |         78.76 ± 0.59 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        177.88 ± 0.39 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         90.56 ± 1.49 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         84.61 ± 0.20 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         77.60 ± 0.82 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         72.65 ± 0.52 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         54.14 ± 0.11 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |         56.69 ± 0.57 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        112.93 ± 0.18 |

I tried using yours, but that is measurably slower than my subgroup-only variant (probably due to no shared memory).

Can this just have a runtime check and avoid shared memory when there's only one subgroup?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 18, 2025

I tried using yours, but that is measurably slower than my subgroup-only variant (probably due to no shared memory).

Can this just have a runtime check and avoid shared memory when there's only one subgroup?

Does the shared memory get optimized out at runtime if it is not used? Maybe just with a specialization constant? I always have some doubts, especially about AMD and Intel optimizers.

@jeffbolznv
Copy link
Collaborator

I think the shared memory is only guaranteed to be optimized out if it's not statically used, so guarding it with a spec constant wouldn't be sufficient.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 18, 2025

@jeffbolznv I unified the subgroup modes and applied the small m optimization to the integer dot shader too, but it just caused a slowdown. In the code it's currently disabled:

    if (b_type == GGML_TYPE_Q8_1) {
        return ctx->device->pipeline_dequant_mul_mat_vec_q8_1_f32[DMMV_WG_SIZE_SUBGROUP][a_type][num_cols-1];
    }

You can replace DMMV_WG_SIZE_SUBGROUP with dmmv_wg to apply your optimization. Do you have any ideas why they don't work together?

@jeffbolznv
Copy link
Collaborator

I see about a 1% increase across most models by using dmmv_wg on 5090. I think this is in line with what I saw in the original change, but I only tested a couple legacy quant models. It seems to help k quants more.

@jeffbolznv
Copy link
Collaborator

I've noticed that some models (llama and qwen, at least?) will reuse the same vector for multiple mat muls. If you could reuse the quantization result, this should be a win more often. And this could also benefit some prompt processing cases. I think Q8_0 is the least likely to ever show a benefit for tg, since it's the most bandwidth-limited.

@jeffbolznv
Copy link
Collaborator

I've noticed that some models (llama and qwen, at least?) will reuse the same vector for multiple mat muls. If you could reuse the quantization result, this should be a win more often. And this could also benefit some prompt processing cases.

I went ahead and added infrastructure for this in #15410. Should be simple to extend it to handle your new path.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from ec3ec03 to 730ba00 Compare August 21, 2025 15:48
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 21, 2025

The vector reuse thing was a good idea, here are updated results:

Nvidia RTX 3090 (without coopmat)

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 2026.20 ± 3.66 2041.47 ± 6.93 +0.8%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 140.97 ± 0.54 142.77 ± 10.62 +1.3%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 2035.22 ± 3.19 2036.84 ± 10.87 +0.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 140.02 ± 1.47 146.64 ± 2.76 +4.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1965.03 ± 9.28 1969.27 ± 9.15 +0.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 125.02 ± 0.93 132.14 ± 0.91 +5.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1935.37 ± 8.67 1954.69 ± 6.31 +1.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 123.14 ± 1.02 131.55 ± 1.38 +6.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1897.59 ± 8.86 1940.74 ± 12.20 +2.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 120.19 ± 0.57 126.01 ± 1.06 +4.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1877.20 ± 8.70 1915.73 ± 12.24 +2.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 116.09 ± 0.83 122.17 ± 0.92 +5.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1919.67 ± 4.49 1933.66 ± 8.28 +0.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 87.51 ± 0.30 87.53 ± 0.11 +0.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1891.76 ± 13.38 1897.69 ± 12.15 +0.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 86.34 ± 0.26 85.98 ± 0.20 -0.4%

AMD Radeon RX 6800 XT

ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 1501.11 ± 2.62 1699.58 ± 2.54 +13.2%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 99.32 ± 0.02 96.59 ± 0.80 -2.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 1479.86 ± 0.78 1674.24 ± 0.82 +13.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 98.23 ± 0.02 95.47 ± 0.01 -2.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1452.99 ± 1.74 1642.41 ± 1.11 +13.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 88.33 ± 0.03 89.33 ± 0.01 +1.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1399.49 ± 0.34 1575.58 ± 0.60 +12.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 83.51 ± 0.02 84.70 ± 0.03 +1.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1413.64 ± 1.21 1593.95 ± 1.98 +12.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 79.98 ± 0.01 82.04 ± 0.01 +2.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1365.12 ± 0.32 1530.70 ± 1.46 +12.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 76.02 ± 0.01 77.66 ± 0.02 +2.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1243.29 ± 0.79 1385.51 ± 1.10 +11.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 55.53 ± 0.01 55.40 ± 0.01 -0.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1206.37 ± 0.50 1338.56 ± 0.56 +11.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 52.97 ± 0.00 52.92 ± 0.01 -0.1%

AMD Radeon Pro VII

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 820.04 ± 4.81 903.12 ± 0.77 +10.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 76.05 ± 0.41 99.26 ± 0.12 +30.5%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 725.11 ± 0.84 776.47 ± 0.96 +7.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 77.94 ± 0.34 100.55 ± 1.10 +29.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 723.72 ± 0.64 737.25 ± 1.64 +1.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 69.50 ± 0.45 88.40 ± 1.56 +27.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 655.13 ± 0.70 666.96 ± 0.93 +1.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 66.54 ± 0.31 81.72 ± 0.27 +22.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 618.58 ± 0.30 647.37 ± 0.51 +4.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 61.02 ± 0.14 89.04 ± 0.30 +45.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 568.47 ± 0.38 592.72 ± 0.66 +4.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 58.54 ± 0.09 82.60 ± 0.46 +41.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 631.94 ± 0.52 646.75 ± 0.76 +2.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 58.38 ± 0.25 65.25 ± 0.18 +11.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 578.33 ± 0.40 591.40 ± 2.00 +2.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 56.35 ± 0.17 62.21 ± 0.12 +10.4%

Intel A770

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 642.87 ± 0.72 744.59 ± 0.69 +15.8%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 43.70 ± 0.07 46.30 ± 0.05 +5.9%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 231.35 ± 0.18 242.53 ± 0.17 +4.8%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 44.48 ± 0.06 46.95 ± 0.07 +5.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 738.90 ± 6.67 825.71 ± 1.69 +11.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 37.27 ± 0.07 37.18 ± 0.11 -0.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 241.38 ± 0.22 250.03 ± 0.08 +3.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 27.89 ± 0.09 27.74 ± 0.06 -0.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 740.37 ± 2.37 820.61 ± 1.95 +10.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 33.07 ± 0.02 42.39 ± 0.09 +28.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 241.59 ± 0.29 251.27 ± 0.21 +4.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 25.51 ± 0.06 30.24 ± 0.13 +18.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 659.15 ± 1.29 733.99 ± 1.75 +11.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 10.85 ± 0.01 33.89 ± 0.06 +212.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 231.58 ± 0.13 241.94 ± 0.27 +4.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 9.90 ± 0.03 25.98 ± 0.01 +162.4%

@jeffbolznv
Copy link
Collaborator

5090 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        221.08 ± 1.41 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        271.33 ± 2.30 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |        175.83 ± 6.33 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         62.29 ± 0.32 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        355.60 ± 3.89 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       202.62 ± 10.30 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        189.65 ± 7.77 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        174.81 ± 4.38 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        168.11 ± 4.52 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        148.07 ± 2.85 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        151.36 ± 7.90 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        244.13 ± 1.13 |

5090 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        227.18 ± 2.28 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        259.18 ± 3.05 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |        179.72 ± 7.38 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         62.74 ± 0.24 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       328.61 ± 24.36 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       206.17 ± 11.95 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        195.52 ± 4.37 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        182.18 ± 4.68 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        173.81 ± 6.15 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        141.36 ± 5.42 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        147.06 ± 6.01 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        235.77 ± 0.62 |

It's a bit surprising that llama 3B Q4_0 is so much slower when the other Q4_0s are faster. Might be worth looking into whether this is a general problem with smaller models. But otherwise this looks like a good improvement and seems fine to enable for NVIDIA (but disable for Q8_0).

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 21, 2025

Maybe with small models the overhead from calling the quantization shader stays larger, while the improvement from the mmvq shader shrinks. I don't think that's a big problem, but maybe a minimum vector size before mmvq gets enabled would solve it?

I'll probably also add an env variable to disable this path separate from the complete integer dot disable.

It would also be interesting to see how this performs on Nvidia Pascal and Turing, to see if they require different tuning than Ampere+. From the results so far I'd say disable Q8_0 on Nvidia and on AMD RDNA+.

@jeffbolznv
Copy link
Collaborator

I don't think that's a big problem, but maybe a minimum vector size before mmvq gets enabled would solve it?

Yeah, or maybe also taking number of rows into account.

It would also be interesting to see how this performs on Nvidia Pascal and Turing, to see if they require different tuning than Ampere+.

Agreed. My guess is that the DP4 path would be relatively better on those.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from 730ba00 to 0cfc795 Compare August 24, 2025 17:10
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from 0cfc795 to adc8bac Compare August 31, 2025 10:00
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 31, 2025

@jeffbolznv I disabled Q8_0 on Nvidia, except for the pre_turing architectures, since it was quite good on my P40. I don't have Turing, so I cannot test if it helps there.

I think the PR is ready now.

Here are updated benchmarks:

Nvidia RTX 3090
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 pp512 7499.61 ± 181.12 7803.81 ± 212.37 +4.1%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 tg128 236.32 ± 29.08 242.68 ± 31.16 +2.7%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 pp512 8216.30 ± 39.15 8576.22 ± 15.12 +4.4%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 tg128 255.82 ± 2.09 264.96 ± 3.34 +3.6%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 pp512 8269.73 ± 163.99 8254.63 ± 113.08 -0.2%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 tg128 173.86 ± 2.60 171.40 ± 1.93 -1.4%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 pp512 9167.98 ± 87.82 9194.90 ± 75.37 +0.3%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 tg128 174.94 ± 0.16 174.41 ± 0.59 -0.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 4312.01 ± 195.96 4398.60 ± 16.59 +2.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 134.74 ± 0.13 140.61 ± 1.09 +4.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 4653.45 ± 6.64 4671.18 ± 4.71 +0.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 136.58 ± 0.62 143.00 ± 0.33 +4.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 4044.61 ± 51.37 4053.99 ± 34.84 +0.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 125.79 ± 0.30 132.25 ± 0.28 +5.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 4285.37 ± 3.79 4283.30 ± 11.77 -0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 128.47 ± 0.22 135.31 ± 0.13 +5.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 4388.12 ± 213.24 4468.90 ± 17.43 +1.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 91.22 ± 0.06 91.00 ± 0.04 -0.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 4752.80 ± 13.12 4754.22 ± 8.53 +0.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 92.64 ± 0.08 92.50 ± 0.09 -0.2%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 pp512 3037.82 ± 6.73 3035.26 ± 4.54 -0.1%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 tg128 93.44 ± 0.13 97.78 ± 0.17 +4.6%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 pp512 3185.39 ± 6.59 3171.11 ± 15.08 -0.4%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 tg128 93.67 ± 0.44 99.13 ± 0.19 +5.8%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 pp512 3113.20 ± 7.07 3093.30 ± 4.23 -0.6%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 tg128 61.51 ± 0.09 61.43 ± 0.04 -0.1%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 pp512 3256.94 ± 17.36 3239.03 ± 25.61 -0.5%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 tg128 62.09 ± 0.04 62.03 ± 0.03 -0.1%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 1699.52 ± 8.85 1688.18 ± 10.40 -0.7%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 51.11 ± 0.22 56.98 ± 0.09 +11.5%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 1743.35 ± 8.75 1735.49 ± 9.92 -0.5%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 50.61 ± 0.12 57.46 ± 0.03 +13.5%
Nvidia Tesla P40
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 pp512 823.66 ± 0.20 1090.01 ± 0.82 +32.3%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 tg128 102.57 ± 5.31 105.36 ± 0.14 +2.7%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 pp512 812.53 ± 0.32 1071.82 ± 0.12 +31.9%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 tg128 94.39 ± 0.19 95.08 ± 0.09 +0.7%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 pp512 688.18 ± 0.12 852.24 ± 0.22 +23.8%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 tg128 63.96 ± 0.04 68.56 ± 0.07 +7.2%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 pp512 679.32 ± 0.11 840.11 ± 0.13 +23.7%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 tg128 57.67 ± 0.02 64.08 ± 0.04 +11.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 346.90 ± 0.08 469.47 ± 0.10 +35.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 51.82 ± 0.06 52.73 ± 0.06 +1.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 343.21 ± 0.11 462.19 ± 0.03 +34.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 48.93 ± 0.03 49.66 ± 0.04 +1.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 323.92 ± 0.05 421.97 ± 0.11 +30.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 47.22 ± 0.05 51.03 ± 0.06 +8.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 321.14 ± 0.04 417.59 ± 0.16 +30.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 44.82 ± 0.04 48.12 ± 0.05 +7.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 281.20 ± 0.03 352.49 ± 0.15 +25.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 27.18 ± 0.01 31.45 ± 0.00 +15.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 279.13 ± 0.04 349.25 ± 0.14 +25.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 25.83 ± 0.05 30.38 ± 0.01 +17.6%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 pp512 240.24 ± 0.03 319.53 ± 0.07 +33.0%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 tg128 34.71 ± 0.06 35.24 ± 0.02 +1.5%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 pp512 237.72 ± 0.03 315.67 ± 0.13 +32.8%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 tg128 32.63 ± 0.01 33.38 ± 0.01 +2.3%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 pp512 197.30 ± 0.16 244.65 ± 0.12 +24.0%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 tg128 17.09 ± 0.01 20.43 ± 0.01 +19.5%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 pp512 195.86 ± 0.02 242.69 ± 0.07 +23.9%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 tg128 16.91 ± 0.01 19.83 ± 0.00 +17.3%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 122.10 ± 0.02 168.24 ± 0.04 +37.8%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 18.68 ± 0.01 19.59 ± 0.00 +4.9%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 121.59 ± 0.02 167.32 ± 0.02 +37.6%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 18.21 ± 0.01 19.03 ± 0.01 +4.5%
AMD Radeon Pro VII
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 pp512 1742.18 ± 15.07 1879.22 ± 1.55 +7.9%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 tg128 141.59 ± 0.32 169.23 ± 0.52 +19.5%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 pp512 1518.90 ± 5.66 1606.50 ± 1.21 +5.8%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 tg128 129.17 ± 0.31 150.46 ± 0.29 +16.5%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 pp512 1525.60 ± 20.68 1640.69 ± 1.25 +7.5%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 tg128 127.93 ± 0.21 138.97 ± 0.16 +8.6%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 pp512 1354.12 ± 0.43 1430.52 ± 0.94 +5.6%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 tg128 116.61 ± 0.50 125.84 ± 1.01 +7.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 724.63 ± 0.74 741.00 ± 0.28 +2.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 73.51 ± 0.27 94.59 ± 0.39 +28.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 655.09 ± 0.57 667.93 ± 0.74 +2.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 69.84 ± 0.08 87.89 ± 0.57 +25.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 619.62 ± 0.59 645.17 ± 0.92 +4.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 64.95 ± 0.03 97.76 ± 0.66 +50.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 567.91 ± 0.28 590.70 ± 0.17 +4.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 62.11 ± 0.07 89.71 ± 0.96 +44.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 635.66 ± 0.50 649.61 ± 0.81 +2.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 59.90 ± 0.08 69.31 ± 1.49 +15.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 582.03 ± 0.42 592.61 ± 0.54 +1.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 57.48 ± 0.15 64.70 ± 0.12 +12.6%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 pp512 473.67 ± 0.18 484.14 ± 0.36 +2.2%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 tg128 49.59 ± 0.07 63.91 ± 0.79 +28.9%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 pp512 435.77 ± 0.27 444.36 ± 0.74 +2.0%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 tg128 47.28 ± 0.07 59.48 ± 0.09 +25.8%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 pp512 414.73 ± 0.52 425.43 ± 0.33 +2.6%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 tg128 39.02 ± 0.19 45.87 ± 0.05 +17.6%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 pp512 384.33 ± 0.19 394.11 ± 0.16 +2.5%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 tg128 37.11 ± 0.09 43.62 ± 0.02 +17.5%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 259.01 ± 0.34 272.98 ± 0.30 +5.4%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 26.56 ± 0.10 35.55 ± 0.04 +33.8%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 246.23 ± 0.13 259.25 ± 0.11 +5.3%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 25.59 ± 0.03 34.18 ± 0.01 +33.6%
AMD Radeon RX 6800 XT
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 pp512 3128.35 ± 19.48 3536.12 ± 10.21 +13.0%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 tg128 177.20 ± 1.14 179.01 ± 0.03 +1.0%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 pp512 2970.69 ± 2.46 3341.61 ± 3.09 +12.5%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 tg128 162.79 ± 0.04 163.95 ± 0.06 +0.7%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 pp512 2647.95 ± 12.80 2927.00 ± 11.57 +10.5%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 tg128 116.75 ± 0.01 117.17 ± 0.03 +0.4%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 pp512 2537.89 ± 1.17 2797.92 ± 2.43 +10.2%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 tg128 110.38 ± 0.02 110.70 ± 0.03 +0.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1472.82 ± 1.49 1659.64 ± 1.99 +12.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 90.91 ± 0.01 92.11 ± 0.01 +1.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1405.20 ± 0.77 1577.67 ± 0.55 +12.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 86.07 ± 0.01 87.16 ± 0.01 +1.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1435.66 ± 1.78 1618.32 ± 1.97 +12.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 82.42 ± 0.01 84.75 ± 0.03 +2.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1373.34 ± 0.10 1539.33 ± 1.18 +12.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 78.20 ± 0.01 80.01 ± 0.01 +2.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1254.60 ± 1.13 1395.16 ± 0.63 +11.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 56.37 ± 0.01 56.76 ± 0.01 +0.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1209.21 ± 0.50 1336.98 ± 0.44 +10.6%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 53.76 ± 0.01 53.96 ± 0.01 +0.4%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 pp512 963.22 ± 0.73 1116.64 ± 1.02 +15.9%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 tg128 62.30 ± 0.01 63.10 ± 0.01 +1.3%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 pp512 927.39 ± 0.44 1070.51 ± 0.67 +15.4%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 tg128 58.56 ± 0.01 59.49 ± 0.01 +1.6%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 pp512 822.15 ± 0.90 947.69 ± 0.70 +15.3%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 tg128 37.35 ± 0.00 37.44 ± 0.01 +0.2%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 pp512 794.27 ± 0.42 911.80 ± 0.35 +14.8%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 tg128 35.84 ± 0.00 35.94 ± 0.00 +0.3%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 471.30 ± 0.61 522.83 ± 0.64 +10.9%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 32.90 ± 0.00 33.52 ± 0.00 +1.9%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 460.56 ± 0.31 510.98 ± 0.61 +10.9%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 31.71 ± 0.00 32.32 ± 0.00 +1.9%
Intel A770
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 pp512 1686.03 ± 1.93 1924.54 ± 2.44 +14.1%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 tg128 68.99 ± 0.06 72.99 ± 0.05 +5.8%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 pp512 414.63 ± 0.20 427.74 ± 0.14 +3.2%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 tg128 44.59 ± 0.02 46.27 ± 0.02 +3.8%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 pp512 1452.93 ± 1.89 1675.73 ± 3.90 +15.3%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 tg128 23.14 ± 0.03 64.00 ± 0.10 +176.6%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 pp512 399.67 ± 0.39 414.76 ± 0.12 +3.8%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 tg128 19.37 ± 0.02 42.54 ± 0.10 +119.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 753.08 ± 1.04 839.51 ± 1.44 +11.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 37.90 ± 0.02 37.98 ± 0.03 +0.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 240.73 ± 0.06 250.99 ± 0.14 +4.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 28.29 ± 0.04 28.40 ± 0.00 +0.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 750.52 ± 2.00 830.83 ± 1.36 +10.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 34.23 ± 0.01 43.26 ± 0.02 +26.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 242.72 ± 0.09 252.55 ± 0.20 +4.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 26.23 ± 0.02 30.89 ± 0.05 +17.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 666.56 ± 0.90 739.84 ± 1.54 +11.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 10.98 ± 0.01 34.76 ± 0.01 +216.6%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 233.17 ± 0.06 243.27 ± 0.15 +4.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 10.01 ± 0.02 26.41 ± 0.01 +163.8%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 pp512 507.31 ± 0.71 569.83 ± 1.67 +12.3%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 tg128 27.08 ± 0.00 25.39 ± 0.03 -6.2%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 pp512 184.43 ± 0.09 193.30 ± 0.08 +4.8%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 tg128 20.63 ± 0.04 19.81 ± 0.01 -4.0%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 pp512 455.05 ± 0.67 511.87 ± 1.19 +12.5%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 tg128 7.28 ± 0.00 24.08 ± 0.00 +230.8%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 pp512 176.99 ± 0.10 185.07 ± 0.07 +4.6%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 tg128 6.74 ± 0.00 18.81 ± 0.00 +179.1%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 266.15 ± 5.52 303.63 ± 3.93 +14.1%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 16.51 ± 0.00 15.77 ± 0.01 -4.5%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 142.71 ± 0.10 153.10 ± 0.05 +7.3%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 13.89 ± 0.00 13.39 ± 0.00 -3.6%

@0cc4m 0cc4m marked this pull request as ready for review August 31, 2025 11:29
@jeffbolznv
Copy link
Collaborator

with the latest, with my suggested change applied:

5090 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |       231.26 ± 10.61 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       281.51 ± 10.23 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |        186.76 ± 7.11 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         66.33 ± 0.16 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       370.73 ± 27.35 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       214.39 ± 11.45 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        202.98 ± 6.09 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        183.65 ± 5.42 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        175.93 ± 7.20 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        150.96 ± 8.88 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        157.93 ± 6.99 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        253.59 ± 6.56 |

5090 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        242.96 ± 1.15 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        286.09 ± 0.72 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |       189.40 ± 10.97 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         66.55 ± 0.21 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       351.70 ± 22.45 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       219.08 ± 13.79 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       203.03 ± 11.97 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        191.77 ± 7.98 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        181.43 ± 9.14 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        152.61 ± 5.92 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        161.01 ± 0.67 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |       250.27 ± 13.11 |

4070 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        104.80 ± 0.21 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        117.87 ± 4.63 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |         80.82 ± 1.45 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        189.62 ± 1.04 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         94.73 ± 0.28 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         85.32 ± 4.02 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         79.61 ± 0.61 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         73.85 ± 1.60 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         55.86 ± 0.06 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |         58.35 ± 0.79 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        117.69 ± 2.41 |

4070 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        102.19 ± 4.27 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        119.87 ± 0.09 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |         80.37 ± 1.67 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        185.89 ± 1.24 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         92.58 ± 3.44 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         86.43 ± 1.26 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         79.41 ± 1.77 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         74.85 ± 0.24 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         55.24 ± 1.31 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |         57.89 ± 1.42 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        117.44 ± 3.40 |

5090 looks good, still the one outlier for the small model but I can live with it. I'm surprised 4070 is so flat.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 31, 2025

5090 looks good, still the one outlier for the small model but I can live with it. I'm surprised 4070 is so flat.

Any ideas why you see a slight reduction on both your cards, but I see a small improvement on 3090? If there's some architectural improvement that makes a difference here in Ada and Blackwell over Ampere, we can try to select mmvq based on that, but if not it might be very hard. I had initially thought it's based on SMs and 5090 is just very underutilized in the quantize shader, but that makes no sense for 4070. We can try picking by matrix size, but I think it depends on architecture and maybe SM count where the threshold is and that would be very hard to pin down.

@jeffbolznv
Copy link
Collaborator

Any ideas why you see a slight reduction on both your cards, but I see a small improvement on 3090? If there's some architectural improvement that makes a difference here in Ada and Blackwell over Ampere

One of the biggest changes from Ampere to Ada was the much larger L2 size. I don't have a good explanation for why that would matter, though. But I've seen surprising behavior when tuning the rm_kq/etc values that might also be explained by caching effects, so it seems plausible.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 1, 2025

It seems I broke llvmpipe again, I'll fix it.

@0cc4m 0cc4m merged commit 02c1813 into master Sep 1, 2025
45 of 48 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-mmq-dp4a-vec branch September 1, 2025 14:19
walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
…gml-org#14903)

* vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants

* vulkan: use subgroup operations for quantize_q8_1 shader

* vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader

* vulkan: use q8_1_x4 blocks in mul_mmq shader

* vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec

* vulkan: tune mul_mat_vecq performance for Intel

* vulkan: fix quantizing issue when tensor is not divisible by 128

* vulkan: adapt integer dot mmv to mmv small m optimization (ggml-org#15355)

* vulkan: allow all subgroup modes for mmv and mmvq

* vulkan: use prealloc intermediate reuse for mmvq path

* vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090

* vulkan: adapt mmv quantize_y path to conditional sync logic

* vulkan: disable q8_0 mmvq on Nvidia

* vulkan: enable q8_0 on Nvidia pre-turing

* fix prealloc sync condition

* fix llvmpipe subgroup 8 issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants